SQL PROJECT - SPATIAL DATA ANALYSIS WITH POSTGIS

Author

Duc Luong

Published

December 3, 2023

1 Introduction

About this project:

As I am new to spatial data analysis, this project will only focus on myself following the instructions and exercises provided by the Introduction to PostGIS workshop. With that said, I will attempt to come up with my own questions and scenarios to apply the newly acquired knowledge. This is an ongoing project and will be updated regularly until the workshop is completed.

About the data:

The data for this workshop is four shapefiles for New York City, and one attribute table of sociodemographic variables.

  1. nyc_census_blocks
Column Description
blkid A 15-digit code that uniquely identifies every census block. Eg: 360050001009000
popn_total Total number of people in the census block
popn_white Number of people self-identifying as “White” in the block
popn_black Number of people self-identifying as “Black” in the block
popn_nativ Number of people self-identifying as “Native American” in the block
popn_asian Number of people self-identifying as “Asian” in the block
popn_other Number of people self-identifying with other categories in the block
boroname Name of the New York borough. Manhattan, The Bronx, Brooklyn, Staten Island, Queens
geom Polygon boundary of the block
  1. nyc_neighborhoods
Column Description
name Name of the neighborhood
boroname Name of the New York borough. Manhattan, The Bronx, Brooklyn, Staten Island, Queens
geom Polygon boundary of the neighborhood
  1. nyc_streets
Column Description
name Name of the street
oneway Is the street one-way? “yes” = yes, “” = no
type Road type (primary, secondary, residential, motorway)
geom Linear centerline of the street
  1. nyc_subway_stations
Column Description
name Name of the station
borough Name of the New York borough. Manhattan, The Bronx, Brooklyn, Staten Island, Queens
routes Subway lines that run through this station
transfers Lines you can transfer to via this station
express Stations where express trains stop, “express” = yes, “” = no
geom Point location of the station
  1. nyc_census_sociodata
Column Description
tractid An 11-digit code that uniquely identifies every census tract. (“36005000100”)
transit_total Number of workers in the tract
transit_private Number of workers in the tract who use private automobiles / motorcycles
transit_public Number of workers in the tract who take public transit
transit_walk Number of workers in the tract who walk
transit_other Number of workers in the tract who use other forms like walking / biking
transit_none Number of workers in the tract who work from home
transit_time_mins Total number of minutes spent in transit by all workers in the tract (minutes)
family_count Number of families in the tract
family_income_median Median family income in the tract (dollars)
family_income_mean Average family income in the tract (dollars)
family_income_aggregate Total income of all families in the tract (dollars)
edu_total Number of people with educational history
edu_no_highschool_dipl Number of people with no high school diploma
edu_highschool_dipl Number of people with high school diploma and no further education
edu_college_dipl Number of people with college diploma and no further education
edu_graduate_dipl Number of people with graduate school diploma

2 Loading spatial data

2.1 Creating a spatial database in PostgreSQL

Select the database. Load PostGIS spatial extension to the database in PgAdmin using the following query.

CREATE EXTENSION postgis;

Confim that PostGIS is installed

SELECT postgis_full_version();

2.2 Importing spatial data with OGR2OGR

Since I already have QGIS installed, I will import spatial data through OGR2OGR.

In OSGeo4W Shell, navigate to the folder containing data to import and type:

ogr2ogr -nln nyc_census_blocks_2000 -nlt PROMOTE_TO_MULTI -lco GEOMETRY_NAME=geom -lco FID=gid -lco PRECISION=NO Pg:"dbname=nyc host=localhost user=postgres password=******** port=5432" nyc_census_blocks_2000.shp

Remember to change the username and password.

Note

Alternatively, data can be restored using .backup files in PgAdmin. To do so, right click on the database, then select Restore…

2.3 Trying to connect and view data with QGIS

Open QGIS, select Layer > Add Layer > Add PostGIS Layers

Click the New button, then fill out Name (nyc), Host (localhost), Database (nyc). Click OK to connect.

Once connected, select tables to add, then click Add.

Select layers to display from the Layers panel.

3 Running non-spatial queries

I will answer some questions about the datasets using SQL.

  1. How many records are there in the nyc_census_block?

    SELECT COUNT(*)
    FROM nyc_census_blocks;

    Output:

    "count"
    38794
  2. What is the total population of New York?

    SELECT SUM(popn_total) AS total_population
    FROM nyc_census_blocks;

    Output:

    "total_population"
    8175032
  3. What is the population of each borough of New York?

    SELECT boroname, SUM(popn_total)
    FROM nyc_census_blocks
    GROUP BY boroname;

    Output:

    | "boroname"      | "sum"   |
    |-----------------|---------|
    | "Queens"        | 2230621 |
    | "Brooklyn"      | 2504700 |
    | "The Bronx"     | 1385108 |
    | "Manhattan"     | 1585873 |
    | "Staten Island" | 468730  |
  4. Summarize demographic data from nyc_census_block table.

    SELECT
      boroname,
      SUM(popn_total) AS total_pop,
      ROUND(((SUM(popn_white) / SUM(popn_total)) * 100)::numeric, 2) AS white_pop_pct,
      ROUND(((SUM(popn_black) / SUM(popn_total)) * 100)::numeric, 2) AS black_pop_pct,
      ROUND(((SUM(popn_asian) / SUM(popn_total)) * 100)::numeric, 2) AS asian_pop_pct,
      ROUND(((SUM(popn_nativ) / SUM(popn_total)) * 100)::numeric, 2) AS native_pop_pct,
      ROUND(((SUM(popn_other) / SUM(popn_total)) * 100)::numeric, 2) AS others_pop_pct
    FROM nyc_census_blocks
    GROUP BY boroname
    ORDER BY 2 DESC;

    Output:

    | "boroname"      | "total_pop" | "white_pop_pct" | "black_pop_pct" | "asian_pop_pct" | "native_pop_pct" | "others_pop_pct" |
    |-----------------|-------------|-----------------|-----------------|-----------------|------------------|------------------|
    | "Brooklyn"      | 2504700     | 42.80           | 34.34           | 10.47           | 0.54             | 11.85            |
    | "Queens"        | 2230621     | 39.72           | 19.13           | 22.94           | 0.69             | 17.51            |
    | "Manhattan"     | 1585873     | 57.45           | 15.56           | 11.32           | 0.55             | 15.13            |
    | "The Bronx"     | 1385108     | 27.90           | 36.47           | 3.58            | 1.32             | 30.72            |
    | "Staten Island" | 468730      | 72.89           | 10.64           | 7.50            | 0.36             | 8.61             |
  5. How many subway stations are there in each borough in New York? Include a total number of stations to each row for comparison.

    SELECT
      borough,
      (SELECT COUNT(*) FROM nyc_subway_stations) AS total_ny_stations,
      COUNT(name) AS stations_count
    FROM nyc_subway_stations
    GROUP BY borough;

    Output:

    | "borough"       | "total_ny_stations" | "stations_count" |
    |-----------------|---------------------|------------------|
    | "Queens"        | 491                 | 90               |
    | "Brooklyn"      | 491                 | 170              |
    | "Staten Island" | 491                 | 22               |
    | "Manhattan"     | 491                 | 140              |
    | "Bronx"         | 491                 | 69               |
  6. Calculate the people per station metric for each borough.

    -- Define a temporary table containing the number of stations per borough
    WITH stations_cte AS (
      SELECT 
        borough,
        COUNT(*) AS total_stations,
        COUNT(CASE WHEN express = 'express' THEN 1 END) AS exp_stations
      FROM nyc_subway_stations
      GROUP BY borough
    )
    
    SELECT 
      boroname,
    
      -- Calculate average population of New York
      AVG((SELECT SUM(popn_total) FROM nyc_census_blocks)) AS ny_populations,
    
      -- Calculate total population for each borough
      SUM(popn_total) AS boro_populations,
    
      -- Calculate average number of all stations per borough
      AVG(total_stations)::int AS stations_count,
    
      -- Calculate average people per all station
      ROUND(SUM(popn_total) / AVG(total_stations)) AS people_per_station,
    
      -- Calculate average number of express stations per borough
      AVG(exp_stations)::int AS express_stations_count,
    
      -- Calculate average people per express station (handle division by zero)
      ROUND(SUM(popn_total) / NULLIF(AVG(exp_stations), 0)) AS people_per_exp_station
    
    FROM nyc_census_blocks ncb
    LEFT JOIN stations_cte sc ON ncb.boroname = sc.borough
    GROUP BY boroname
    ORDER BY boro_populations DESC;

    Output:

    | "boroname"      | "ny_populations" | "boro_populations" | "stations_count" | "people_per_station" | "express_stations_count" | "people_per_exp_station" |
    |-----------------|------------------|--------------------|------------------|----------------------|--------------------------|--------------------------|
    | "Brooklyn"      | 8175032          | 2504700            | 170              | 14734                | 31                       | 80797                    |
    | "Queens"        | 8175032          | 2230621            | 90               | 24785                | 14                       | 159330                   |
    | "Manhattan"     | 8175032          | 1585873            | 140              | 11328                | 62                       | 25579                    |
    | "The Bronx"     | 8175032          | 1385108            |                  |                      |                          |                          |
    | "Staten Island" | 8175032          | 468730             | 22               | 21306                | 0                        |                          |

4 Running geometries functions

Answer some questions about New York city streets using geometries functions.

  1. What is the total length of all streets in New York in kilometers?

    SELECT ROUND(SUM(ST_Length(geom))::numeric / 1000, 2) AS total_length_km
    FROM nyc_streets;

    Output:

    "total_length_km"
    10418.90
  2. How many types of streets are there and what is the total length of each type?

    SELECT 
      type,
      ROUND(SUM(ST_Length(geom))::numeric, 2) AS length_m
    FROM nyc_streets
    GROUP BY type
    ORDER BY length_m DESC;

    Output:

    | "type"                                             | "length_m" |
    |----------------------------------------------------|------------|
    | "residential"                                      | 8629870.34 |
    | "motorway"                                         | 403622.48  |
    | "tertiary"                                         | 360394.88  |
    | "motorway_link"                                    | 294261.42  |
    | "secondary"                                        | 276264.30  |
    | "unclassified"                                     | 166936.37  |
    | "primary"                                          | 135034.23  |
    | "footway"                                          | 71798.49   |
    | "service"                                          | 28337.64   |
    | "trunk"                                            | 20353.58   |
    | "cycleway"                                         | 8863.75    |
    | "pedestrian"                                       | 4867.05    |
    | "construction"                                     | 4803.08    |
    | "residential; motorway_link"                       | 3661.58    |
    | "trunk_link"                                       | 3202.19    |
    | "primary_link"                                     | 2492.57    |
    | "living_street"                                    | 1894.64    |
    | "primary; residential; motorway_link; residential" | 1367.77    |
    | "undefined"                                        | 380.54     |
    | "steps"                                            | 282.75     |
    | "motorway_link; residential"                       | 215.08     |
  3. Calculate some summary statistics about New York streets.

    SELECT
      ROUND(SUM(ST_Length(geom))) AS total_length_m,
      ROUND(AVG(ST_Length(geom))::numeric, 2) AS avg_length,
      ROUND(percentile_cont(.5) 
          WITHIN GROUP (ORDER BY ST_Length(geom))::numeric, 2) 
          AS median_length,
      ROUND(MAX(ST_Length(geom))) AS longest_length,
      (
        SELECT name 
        FROM nyc_streets 
        ORDER BY ST_Length(geom) DESC
        LIMIT 1
      ) AS longest_street
    FROM nyc_streets;

    Output:

    | "total_length_m" | "avg_length" | "median_length" | "longest_length" | "longest_street"  |
    |------------------|--------------|-----------------|------------------|-------------------|
    | 10418905         | 545.75       | 269.11          | 14918            | "Leif Ericson Dr" |
  4. Find the named street of the northernmost point in New York.

    Firstly, I need to find the northermost point in New York. The northermost point should be a coordinate with the highest lattitude. A street is a multilinestring type that contains multiple points. Therefore, my approach is to extract all points from all street geometries, then extract the lattitude and find the highest one. After the lattitude is found, I will find the name of the street by filtering street coordinates that contain this lattitude.

    • First query: find the highest lattitude of streets in New York.
    -- Dump all points from streets with non-null name.
    WITH dump_points_cte AS
      (SELECT ST_DumpPoints(geom) AS dp
      FROM nyc_streets
      WHERE name IS NOT NULL)
    
    -- Extract the highest lattitude point, i.e the northernmost point.
    SELECT ST_Y((dp).geom) AS lattitude
    FROM dump_points_cte
    ORDER BY 1 DESC
    LIMIT 1;

    Output:

    "lattitude"
    4529895.358189691
    • Second query: filter the street name based on the lattitude found earlier.
    -- Find the non-null street that contains the highest lattitude point.
    SELECT name, ST_AsText(geom)
    FROM nyc_streets
    WHERE ST_AsText(geom) LIKE '%4529895.358189691%'
      AND name IS NOT NULL;

    Output:

    | "name"     | "st_astext"                                                                                  |
    |------------|----------------------------------------------------------------------------------------------|
    | "River Rd" | "MULTILINESTRING((591838.3024050258 4529895.358189691,591831.6719192386 4529886.060201272))" |
    • Third query: transform one point of the linestring from SRID 26918 to SRID 4326 to find the coordinates on Google Maps.
    -- The WKT point coordinates are taken from one of the coordinates in the "st_astext" column.
    SELECT ST_AsText(ST_Transform(ST_GeomFromText('POINT(591838.3024050258 4529895.358189691)',26918),4326)) AS wgs_geom;

    Output:

    "wgs_geom"
    "POINT(-73.90941589999974 40.91501400000343)"

    Reverse the order of the coordinates as 40.91501400000343 -73.90941589999974 (latitude, longitude), then search on Google Maps to double check.

The northernmost street of New York. Click the image to navigate to the exact coordinates on Google Maps.

5 Spatial relationships and spatial joins

Let’s imagine a scenario involving my friend, Fred. Fred is embarking on a month-long business trip to New York City and will be commuting to Wall Street regularly. His workplace is the renowned New York stock exchange (NYSE). He has asked for my assistance in identifying the perfect neighborhood for him to find accommodation. His key criteria include:

  1. The chosen neighborhood must be within a 5 km radius of Wall Street, and it must not be located in Manhattan borough.
  2. It should have at least 10 subway stations.
  3. Rank the neighborhoods based on population density in the ascending order. Fred prefers less populated areas.

5.1 Preparation

Let’s make some preparations before solving each of the criteria above.

Firstly, as there are three different Wall St in New York, I need to identify the correct Wall St in Manhattan. To do so, the “Wall St” I looking for should be within a 300 m radius of NYSE.

From Google Maps, I found the coordinate values of NYSE to be: 40.70723482298096, -74.01117110032183. My job is to transform it into SRID 26918 so that I can compare it with geometries within my dataset.

-- Transform the coordinates of NYSE to SRID 26918.
SELECT ST_AsText(
  ST_Transform(
    ST_GeomFromText('POINT(-74.01117110032183 40.70723482298096)', 4326), 
    26918
  )
)

Output:

'st_astext'
'POINT(583529.537283629 4506728.297607037)'

Next, run a query to find all streets with the name ‘Wall St’. Then filter streets that are within a 300 m radius of NYSE.

SELECT 
  gid, 
  name,
  ST_NPoints(geom),
  ST_Length(geom)
FROM nyc_streets
WHERE name LIKE 'Wall St%'
AND ST_DWithin(geom, ST_GeomFromText('POINT(583529.537283629 4506728.297607037)', 26918), 300)

Output:

| 'gid' | 'name'    | 'st_npoints' | 'st_length'       |
|-------|-----------|--------------|-------------------|
| 17385 | 'Wall St' | 9            | 603.2621123790169 |

Create a table to store the geometry to use in future queries.

CREATE TABLE wall_st AS
    SELECT geom AS wall_st_geom
    FROM nyc_streets
    WHERE gid = '17385';

5.2 Identifying the perfect neighborhood

  1. Find all neighborhood within a 5 km radius of Wall Street but doesn’t belong to Manhattan borough

    It is now much more convenient to find the required neighborhoods.

    SELECT name, boroname
    FROM nyc_neighborhoods
    WHERE ST_DWithin(
      geom, 
      (SELECT wall_st_geom FROM wall_st),
      5000
    )
    AND boroname != 'Manhattan'

    This query results in 11 neighborhoods, all in Brooklyn.

    | 'name'               | 'boroname' |
    |----------------------|------------|
    | 'Red Hook'           | 'Brooklyn' |
    | 'Boerum Hill'        | 'Brooklyn' |
    | 'Downtown'           | 'Brooklyn' |
    | 'Greenwood'          | 'Brooklyn' |
    | 'Carroll Gardens'    | 'Brooklyn' |
    | 'Cobble Hill'        | 'Brooklyn' |
    | 'Fort Green'         | 'Brooklyn' |
    | 'Flatbush'           | 'Brooklyn' |
    | 'Park Slope'         | 'Brooklyn' |
    | 'Williamsburg'       | 'Brooklyn' |
    | 'Bedford-Stuyvesant' | 'Brooklyn' |
  2. Filter neighborhoods with at least 10 subway stations

    First, create a table to store the geometries of the 11 neighborhoods to make the future queries easier to read.

    CREATE TABLE selected_neighborhoods AS
        (SELECT *
        FROM nyc_neighborhoods
        WHERE ST_DWithin(
          geom, 
          (SELECT wall_st_geom FROM wall_st),
          5000
        )
        AND boroname != 'Manhattan');
    
    SELECT * FROM selected_neighborhoods;

    Output:

    | 'gid' | 'boroname' | 'name'               | 'geom'           |
    |-------|------------|----------------------|------------------|
    | 125   | 'Brooklyn' | 'Red Hook'           | '01060000202...' |
    | 28    | 'Brooklyn' | 'Boerum Hill'        | '01060000202...' |
    | 34    | 'Brooklyn' | 'Downtown'           | '01060000202...' |
    | 62    | 'Brooklyn' | 'Greenwood'          | '01060000202...' |
    | 118   | 'Brooklyn' | 'Carroll Gardens'    | '01060000202...' |
    | 29    | 'Brooklyn' | 'Cobble Hill'        | '01060000202...' |
    | 94    | 'Brooklyn' | 'Fort Green'         | '01060000202...' |
    | 47    | 'Brooklyn' | 'Flatbush'           | '01060000202...' |
    | 54    | 'Brooklyn' | 'Park Slope'         | '01060000202...' |
    | 59    | 'Brooklyn' | 'Williamsburg'       | '01060000202...' |
    | 116   | 'Brooklyn' | 'Bedford-Stuyvesant' | '01060000202...' |

    Run a query to filter the neighborhoods with at least 10 subway stations.

    SELECT sn.name, COUNT(nss.geom)
    FROM nyc_subway_stations AS nss
    JOIN selected_neighborhoods AS sn
    ON ST_Contains(sn.geom, nss.geom)
    GROUP BY sn.name
    HAVING COUNT(nss.geom) >= 10;

    Output:

    | 'gid' | 'name'               | 'count' |
    |-------|----------------------|---------|
    | 59    | 'Williamsburg'       | 13      |
    | 94    | 'Fort Green'         | 13      |
    | 116   | 'Bedford-Stuyvesant' | 13      |
  3. Rank the selected neighborhoods based on population density ascending.

    SELECT 
      sn.name, 
      ROUND((ST_Area(sn.geom) / 1000000)::numeric, 2) AS area_km2, 
      SUM(ncb.popn_total) AS population, 
      ROUND(SUM(ncb.popn_total) / (ST_Area(sn.geom)/1000000)) AS pop_density_km2,
      RANK() OVER (ORDER BY (SUM(ncb.popn_total) / (ST_Area(sn.geom)/1000000)) ASC)
    FROM selected_neighborhoods AS sn
    LEFT JOIN nyc_census_blocks AS ncb
    ON ST_Intersects(sn.geom, ncb.geom)
    WHERE sn.gid IN ('59', '94', '116')
    GROUP BY sn.name, ST_Area(sn.geom);

    Output:

    | 'name'               | 'area_km2' | 'population' | 'pop_density_km2' | 'rank' |
    |----------------------|------------|--------------|--------------------|--------|
    | 'Williamsburg'       | 7.61       | 135157       | 17763              | 1      |
    | 'Fort Green'         | 5.83       | 125039       | 21451              | 2      |
    | 'Bedford-Stuyvesant' | 10.78      | 242871       | 22537              | 3      |

    Looks like Williamsburg is the best fit for my friend. However, Fort Green and Bed-Stuy also look very promising.

    Williamsburg neighborhood on Google Maps.
Caution

The calculated area, population, and population density in this project are intended for demonstration purposes only. Results may significantly differ from data available online. Use with caution.